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Abstract 

Mutation rate variation across loci is well known to cause difficulties, no- 
tably identifiability issues, in the reconstruction of evolutionary trees from 
molecular sequences. Here we introduce a new approach for estimating gen- 
eral rates-across-sites models. Our results imply, in particular, that large 
phylogenies are typically identifiable under rate variation. We also derive 
sequence-length requirements for high-probability reconstruction. 

Our main contribution is a novel algorithm that clusters sites accord- 
ing to their mutation rate. Following this site clustering step, standard re- 
construction techniques can be used to recover the phylogeny. Our results 
rely on a basic insight: that, for large trees, certain site statistics experience 
concentration-of-measure phenomena. 
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1 Introduction 



The evolutionary history of living organisms is typically represented graphically 
by a phylogeny, a tree whose branchings indicate past speciation events. The infer- 
ence of phylogenies based on molecular sequences extracted from extant species 
is a major task of computational biology. Among the many biological phenomena 
that complicate this task, one that has received much attention in the statistical 
phylogenetics literature is the variation in mutation rate across sites in a genome. 
(See related work below.) Such variation is generally attributed to unequal de- 
grees of selective pressure. As we describe formally below, mathematically this 
phenomenon can be modeled as a mixture of phylogenies. That is, interpreting 
branch length as a measure of the amount of evolutionary change, rates-across- 
sites (RAS) models posit that all sites in a genome evolve according to a common 
tree topology, but branch lengths for a given site are scaled by a random factor. 

Here we introduce a new approach for estimating RAS models. Our main con- 
tribution is a novel algorithm which clusters the sites according to their mutation 
rate. We show that our technique may be used to reconstruct phylogenies. In- 
deed, following the site clustering step, standard reconstruction techniques can be 
employed to recover a phylogeny on the unmixed subset of sites obtained. Our re- 
sults rely on the following basic insight: there exist simple site-wise statistics that 
experience concentration-of-measure phenomena. Consequently, our techniques 
only hold in the large-tree limit. 

Concentration has been used extensively in statistical phylogenetics. However 
its t5^ical use is in the large-sample limit, that is, as the sequence length grows to 
infinity, for instance in order to show that so-called evolutionary distance estimates 
are accurate given sufficiently long sequences (see e.g. [ESSW99]). Instead, we 
consider here concentration in what we call the large-tree limit, that is, as the 
number of leaves goes to infinity. Note that the latter is trickier to analyze. Indeed, 
whereas different sites are usually assumed to evolve independently, leaf states are 
not independent. To the best of our knowledge, this is the first use of this type of 
concentration in the context of phylogenetics. 

Our results imply, in particular, that large phylogenies are typically identifi- 
able under rate variation. We also derive sequence-length requirements for high- 
probability reconstruction. 
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1.1 Related work 



Most prior theoretical work on mixture models has focused on the question of 
identifiability. A class of phylogenetic models is identifiable if any two models 
in the class produce different data distributions. It is well-known that unmixed 
phylogenetic models are typically identifiable [Cha96]. This is not the case in 
general for mixtures of phylogenies. For instance, Steel et al. [SSH94] showed 
that for any two trees one can find a random scaling on each of them such that 
their data distributions are identical. Hence it is hopeless in general to reconstruct 
phylogenies under mixture models. See also [EW04, MS07, MMS08, SVOVb, 
SVOVa, Ste09] for further examples of this type. 

However the negative examples constructed in the references above are not 
necessarily typical. They use special features of the mutation models (and their 
invariants) and allow themselves quite a bit of flexibiUty in setting up the topolo- 
gies and branch lengths. In fact, recently a variety of more standard mixture mod- 
els have been shown to be identifiable. These include the common GTR+Gamma 
model [AAR08, WSIO] and GTR+Gamma+I model [CHU], as well as some co- 
varion models [AR06], some group-based models [APRS 11], and so-called r- 
component identical tree mixtures [RSIO]. Although these results do not provide 
practical algorithms for reconstructing the corresponding mixtures, they do give 
hope that these problems may be tackled successfully. 

Beyond the identifiability question, there seems to have been little rigorous 
work on reconstructing phylogenetic mixture models. One positive result is the 
case of the molecular clock assumption with across-sites rate variation [SSH94], 
although no sequence-length requirements are provided. There is a large body of 
work on practical reconstruction algorithms for various types of mixtures, notably 
rates-across-sites models and covarion-type models, using mostly likelihood and 
bayesian methods. See e.g. [Fel04] for references. But the optimization prob- 
lems they attempt to solve are likely NP-hard [CT06, Roc06]. There also exist 
many techniques for testing for the presence of a mixture (for example, for test- 
ing for rate heterogeneity), but such tests typically require the knowledge of the 
phylogeny. See e.g. [HR97]. 

Here we give both identifiability and reconstruction results. Whereas Steel 
et al. [SSH94] show that any two fixed trees can be made indistinguishable with 
an appropriate (arbitrarily complex) choice of scaling distributions, we show in 
essence that, given a fixed rate distribution (or a well-behaved class of rate dis- 
tributions), sufficiently large trees are typically distinguishable. After a draft of 
our results were circulated [MR08], related results for large trees were established 
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by Rhodes and SuUivant [RSIO] using different techniques. In particular, our 

technical assumptions are similar in spirit to the genericity condition in [RSIO]. 
Although our genericity assumptions are stronger, they allow an efficient recon- 
struction of the model and explicit bounds on sequence-length requirements. Note 
moreover that our results apply to general, possibly continuous, nonparametric 
rate distributions. 

The proof of our main results relies on the construction of a site clustering 
statistic that discriminates between different rates. A similar statistic was also 
used in [SS06] in a different context. However, in contrast to [SS06], our main 
reconstruction result requires that a site clustering statistic be constructed based 
only on data generated by the mixture — ^that is, without prior knowledge of the 
model. 

1.2 Overview of techniques 

A simplified setting To illustrate our main ideas, we first consider a simple two- 
speed model. Assume that molecular sequences have two types of sites: "slow" 
and "fast." Both types of sites evolve independently by single substitution on a 
common evolutionary tree according, say, to a standard Jukes-Cantor model of 
substitution, but the fast ones evolve three times as fast. See Section 2 for a 
formal definition of the Jukes-Cantor model. To keep things simple, assume for 
now that the evolutionary tree is a complete binary tree with n = 2^ leaves, where 
h is the number of levels. (Note that our results apply to much more general rate 
distributions. We also discuss how to deal with general trees. See below.) 

Our approach is based on the following question: Is it possible to tell with high 
confidence which sites are slow or fast, with no prior knowledge of the phylogeny 
that generated them? Perhaps surprisingly, the answer is yes — at least for large 
trees. This far-reaching observation does not seem to have been made previously. 

To see how this works, assume for the time being that we know the phylogeny. 
We will show how to remove this assumption below. Take a pair of leaves a, b. 
The effect of the speed of a site can be seen in the probability of agreement be- 
tween a and b: the leaves agree more often on slow-evolving sites. Hence, if a site 
shows agreement between a and b, one may deduce that the site is more likely to 
be slow-evolving. But this is too little information to infer with high confidence 
the speed of a site. Instead, one may look at a larger collection of pairs of leaves 
and consider the statistic that counts how many of them agree on a given site. The 
idea is that a large number of agreements should indicate a slow site. For this 
scheme to work accurately, we require two properties from this statistic: separa- 
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tion and concentration. By separation, we mean that the expected value of the 

statistic should be different on slow and fast sites — in order to distinguish them. 
By concentration, we mean that the statistic should lie very close to its expecta- 
tion. These two properties produce a good site clustering test. To satisfy them, the 
pairs of leaves involved must be chosen carefully. 

Separation and concentration To obtain separation, it is natural to use only 
pairs of "close" leaves. Indeed, leaves that are far away are practically indepen- 
dent and the speed of a site has very little noticeable effect on their agreement. 
As for concentration, what one needs is the kind of conditions that give rise to the 
central Umit theorem: a large sum of small independent contributions. For sym- 
metric models such as the Jukes-Cantor model, the agreement events on two pairs 
of leaves (a, b) and (c, d) are independent as long as the paths between [a, h) and 
(c, d) do not intersect. Therefore, we are led to consider the following statistic: 
count how many cherries (that is, sister leaves) agree and divide by the total num- 
ber of cherries to obtain a fraction. One can show from the considerations above 
that such a statistic is highly concentrated. 

Unknown, general tree However our derivation so far has relied heavily on two 
unsatisfied premises: 

1. That the tree is known. This is of course not the case since our ultimate goal 
is precisely to reconstruct the phylogeny. 

2. And that the tree is complete. In particular, our argument uses the fact that 
complete binary trees contain many cherries. But general trees may have 
very few cherries. 

Perhaps surprisingly, neither of these conditions is necessary. The bulk of the 
technical contributions of this paper lie in getting rid of these assumptions. We 
show in particular how to construct a site clustering statistic similar to the one 
above directly from the data without prior knowledge of the tree. At a high level, 
all one needs is to select a large collection of "sufficiently correlated" pairs of 
leaves and then "dilute" them to discard pairs that are too close to each other. This 
leads to a highly concentrated site-wise statistic. See Section 2 for a statement of 
our results. 
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2 Definitions and Results 



2.1 Basic Definitions 

Phylogenies A phylogeny is a graphical representation of the speciation history 
of a group of organisms. The leaves typically correspond to current species. Each 
branching indicates a speciation event. Moreover we associate to each edge a 
positive weight. This weight can be thought roughly as the time elapsed on the 
edge multiplied by the mutation rate which may also depend on the edge. More 
formally: 

Definition 1 (Pliylogeny) A phylogeny T — {V, E; L, /x) is a tree with vertex set 
V, edge set E and n (labelled) leaves L — [n] = {1, . . . , n} such that 1) the 

degree of all internal vertices V — L is exactly 3, and 2) the edges are assigned 
weights n : E ^ (0, +oo). We let T[T] = (V, E; L) be the topology ofT. A 
phylogeny is naturally equipped with a so-called tree metric on the leaves d : 
L X L — > (0, +oo) defined as follows 



where Pathx(M, v) is the set of edges on the path between u and v in T. We will 
refer to d{u, v) as the evolutionary distance between u and v. Since under the 
assumptions above there is a one-to-one correspondence between d and n (see 
e.g. [SS03]), we write either T = {V,E;L,d) or T ^ {V, E; L, fi). We also 
sometimes use the natural extension ofdto the internal vertices ofT. 

We will sometimes restrict ourselves to the following standard special case. 

Definition 2 (Regular Piiylogenies) Let < f < g < +oo. We denote by Tf,g 
the set of phylogenies T — {V, E; L, p) such that \fe & E, f < fj,e < g. 

Poisson Model A standard model of DNA sequence evolution is the following 
Poisson model. See e.g. [SS03]. 

Definition 3 (Poisson Model) Consider the following stochastic process. We are 

given a phylogeny T = (V, E; [n], ji) and a finite set IZ with r elements. Let tt be 
a probability distribution on TZ. Let Q gW^^ be the following rate matrix 




e£PathT(u,D) 
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Associate to each edge e E E the stochastic matrix 




TT^j; + (1 - n^)e ifx = y, 
7ry{l — e~'^''), o.w. 



The process runs as follows. Choose an arbitrary root p E V. Denote by 
the set E directed away from the root. Pick a state for the root according to tt. 
Moving away from the root toward the leaves, apply the channel M(e) to each 
edge e independently. Denote the state so obtained uy — {a^y^y. In particular, 



For W C V, we denote by fiw ^he marginal of iiy at W. Under this model, 
the weight He is the expected number of substitutions on edge e in a related 
continuous-time process. The r-state Poisson model is the special case when tt 
is the uniform distribution over 71. In that case, we denote the distribution ofay 
by VlT, r]. When r is clear from the context, we write instead ay ~ T>\T]. 

More generally, we take k independent samples {ay)\^-y from the model above, 
that is, (Ty, . . . , (Ty are i.i.d. V\T, r]. We think of (o-*)^^^^ as the sequence at node 
V E V. Typically, IZ = {A, G, C,T} and the model describes how DNA se- 
quences stochastically evolve by point mutations along an evolutionary tree — 
under the assumption that each site in the sequences evolves independently. 

Example 1 (CFN and Jukes-Cantor Models) The special case r = 2 corre- 
sponds to the so-called CFN model. The special case r = 4 is the well-known 
Jukes -Cantor model. 

We fix r throughout. 

Remark 1 We discuss the more general GTR model in the concluding remarks. 

Rates-across-sites model We introduce the basic rates-across-sites model which 
will be the focus of this paper. We will use the following definition. 

Definition 4 (Phylogenetic Scaling) Let T = {V. E; [n], /i) be a phylogeny and 
A, a constant in [0, +oo). Then we denote by AT the phylogeny obtained by 
scaling the weights ofT by A, that is, AT = {V, E; [n], A/x). 




e=iu,v)eE^ 
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Definition 5 (Rates-Across-Sites Model (see e.g. [SSH94])) In the generalized Pois- 

son model we are given a phylogeny T and a scaling factor A, that is, a ran- 
dom variable on [0, +00). Let Ai, . . . , A/,, he i.i.d. copies of A. Conditioned on 
Ai, . . . , Ajfc, the samples {ay)^^^ generated under this model are independent with 
a-y ~ V[AjT], j — 1, . . . ,k. We denote by T>\T, A, r] the probability distribution 
ofay. We also let T>\T, A, r] be the probability distribution of(T\. 

2.2 Main results 

Tree identifiability To provide a uniform bound on the minimum tree size re- 
quired for our identifiability result to hold, we make explicit assumptions on the 
mutation model. For s > 0, let 

= E [e-^^] , 

be the moment generating function (or one-sided Laplace transform) of the scaling 
factor A. The probability distribution of A is determined by $. See e.g. [Bil95]. 
We normalize A so that 

-$'(0) = E [A] = 1. 
In particular, A is not identically and $ is continuous and strictly decreasing. 

Assumption 1 Let < f < g < +00, and M > 0. The following set of assump- 
tions on a generalized Poisson model will be denoted by A(/, M): 

L Regular phylogeny: The phylogeny T — (V, E; [n] , p) is in Tf,g. 

2. Mass close to 0: We have that 

$-1 < ^ 

In words, an evolutionary distance of M under A-scaling produces a correla- 
tion corresponding to an evolutionary distance of at least 6g without the scaling. 
We denote by GPM(/, g, M, uq) the set of generalized Poisson models satisfying 
A(/, g, M) with at least uq leaves. 

Remark 2 Note that the results in [SSH94] indicate that appropriate conditions 
are needed to obtain a tree identifiability result in the generalized Poisson model 
when the random scaling is unknown. We do not claim that the conditions above 
are minimal. The first assumption is meant to ensure that there is enough signal to 
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reconstruct the tree. The second assumption bounds the distortion of the random 

scaling for evolutionary distances corresponding to short paths. It essentially 
implies that the probability mass of h. close to is bounded. In particular, note 
that if the probability mass below 

is less than 5 (for 6 < e~^^), then 

$(M) < S+{1-S)e-'^ < e-^^ 

and the assumption is satisfied. Conversely, if A satisfies the second assumption 
then the probability mass S below e (for e < 6g/M) must be such that 

since 

> $(M) > 5e-'^. 
Remark 3 The second assumption implicitly implies that 

P[A = 0] < e"^^. 

This is in fact not necessary. By first removing all invariant sites, it should be 
possible to extend our main theorem to moment generating functions of the form 

where < o; < 1 uniformly bounded away from 1 and satisfies the assump- 
tions above. Indeed, on a large phylogeny, it is extremely unlikely to produce an 
invariant site using a positive scaling factor. Hence removing all invariant sites 
has the effect of essentially restricting the dataset to the positive part of the distri- 
bution of K. We leave the details to the reader Given this observation, in the rest 
of the manuscript one can assume that 

p[A = 0] = 0. 

Theorem 1 (Tree identifiability) Fix < f < g < +oo, and M > 0. Then, 
there exists no(/, g, M) > 1 such that, if{T, A) and (T', A') are in GPM(/, g, M, no) 
with r[T] r[T'], then 

V[T,A,r] ^V[r,A',r]. 
(Recall that T>\T, A, r] denotes the distribution at the leaves.) 
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Remark 4 Note that we allow A ~ A' (where ~ denotes equality in distribution). 
This is the sense in which our result is a tree identifiability result. 

Remark 5 Note that our identifiability result applies only to sufficiently large 
phylogenies. Computing uq from our techniques is difficult (and in general de- 
pends on the parameters f. g, M). One could estimate the required size by run- 
ning the reconstruction algorithm below on simulated data for various sizes and 
parameters. We leave such empirical studies for future work. 

The proof of our main theorem relies on the following reconstruction result. 

Tree reconstruction Moreover, we give a stronger result implying that the phy- 
logeny can be reconstructed with high confidence using polynomial length se- 
quences in polynomial time. The proof appears in Sections 3, 4 and 5. 

Theorem 2 (Tree Reconstruction) Under Assumption 1, for all < 5 < 1, there 
is a 7fe > large enough so that the topology of the tree can be reconstructed in 
polynomial time using k — n"' samples, except with probability S. 

Remark 6 Once the tree has been estimated, one can also infer the rate distribu- 
tion. Details are left to the interested reader. 

3 Site clustering statistic: Existence and properties 

In this section, we introduce our main site clustering statistic and show that it is 
concentrated. Let < f < g < +oo and M > 0. In this section, we fix a 
phylogeny T = {V,E; [n],/x) in 7},^. We let (cr^)i=i be k i.i.d. samples from 
©[T, A, r] where the generalized Poisson model (T, A) satisfies Assumption 1. 
Moreover, let Ai, . . . , A^ be the i.i.d. scaling factors corresponding to the k sam- 
ples above. 

Some notation We will also use the notation [n]^ = {{a,b) G [n]x[n] : a<b}, 
[n\t = {(a, a)}ae[n], and [n]^ = [n]^ — [n]?,. We also denote by [n]^ the set of 
pairs (a, 6), (c, d) G [n]^ such that (a, b) ^ (c, d) (as pairs). For o; > 0, we let 

Tq, = {{a, b) e [n]^ : d(a, b) < a}, 

be all pairs of leaves in T at evolutionary distance at most a. Let Poo — ^ — Qoo 
where ?oo = Exe7j7r^- 
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3.1 What makes a good site clustering statistic? 

For a site i — 1, . . . , /c, consider a statistic of the form 

H p-MiK = <^a-?oo], (1) 



{a,b)er 



where T C [n]^, is a subset of pairs of distinct leaves independent of i. Using the 
expression for the transition matrix given in Definition 3, note that 

m I = M E p~ M^i< - I - 



(a,b)eT 



— y 



{a,b)er 



5]7r47r. + (l-7r.)e-W))_^^ 



(a,6)eT 

which is strictly decreasing in Aj. We need two properties for (1) to make a good 
site clustering statistic: separation and concentration. 

For separation, that is, for the statistic above to distinguish different scaling 
factors as much as possible, we require the following condition: 

SI Each pair in T is composed of two "sufficiently close" leaves, that is, there 
is a < +00 such that T C Tq,. 

Indeed, if two leaves are far away, their joint distribution is close to independent 
and scaling has little effect on their agreement. A much better separation is ob- 
tained from close leaves. 

To guarantee concentration of a statistic of the type (1), we require the follow- 
ing three conditions on T: 

CI The set T is "large enough" and each pair makes a "small contribution" to 
the sum. This will be satisfied if we show that we can take |T| = 6(n), as 
(1) is a sum of {0, l}-variables. 

C2 Agreement for different pairs in T is "sufficiently uncorrected," e.g., inde- 
pendent. 

These conditions will allow us to apply standard large deviations arguments. 
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Example 2 (Full sum) As a first guess, one may expect that taking T to be all 

pairs of leaves may give a good site clustering statistic. However, in general 
this is not the case as we show in the following example. Consider the two-state 
case on a complete binary tree with identical edge lengths /i and A = 1. For 
mathematical convenience, assume that the states are TZ — {+1,-1} and let 

Then, up to a multiplicative factor and additive constant, the clustering statistic 
is simply 

(a,6)eN^ 

for a tree with h levels. Using a calculation of [EKPSOO, Section 5], one has 

E[a„(7;,] = e-'^^"'"). (2) 

Dividing the expectation into terms over the first subtree of the root, terms over 
the second subtree of the root, and terms between the two subtrees, we have 

Solving for the recursion gives 

E = 72^-2^^ = O {2'^^''^) , 

as /i — > oo. On the other hand, the expectation of the square E[(W^'*^)^] is a sum 
of terms of the form 'E^az^CFz^Uz^CFz^ where some of the z's may be repeated. All 
such terms are non-negative because of (2) and the fact that terms where all z's 
are different factor into a product by Proposition 1 below. Hence 

(a,b)e[n\l 

^ 2^(2^-1) „,2o2.-4|^7^-iy 

ifj< 1. Hence, assuming that j < 1, we have 

E[WW]' 

Var [WW] ^ °' 

as h ^ oo. In other words, the sum over all pairs is too "noisy" to serve as a site 
clustering statistic in that case. 
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3.2 Does it exist? 



We now show that there always exist statistics that satisfy the properties above and 
we give explicit guarantee on their concentration. Note that, in the current section, 
we only provide a proof of existence. In particular, in establishing existence, 
we use evolutionary distances which are not available from the data. Later, in 
Section 4, we explain how to construct such a statistic from the data ©[T, A,r] 
(or, more precisely, the samples {<j}^)i=i) without knowledge of the tree topology, 
evolutionary distances, or site scaling factors. 

We now explain how the conditions above can be achieved for an appropriate 
choice of T on any tree topology. Note, however, that T depends on T. 

Independence We first show that the clustering statistic (1) is a sum of indepen- 
dent variables as long as the paths between different pairs do not intersect. This 
will allow us to satisfy C2. 

Proposition 1 (Independence) Assume that for all (a, 6), (a', h') e Twith (a, b) ^ 
{a',b')we have 

Path(a, b) n Path(a', b') = 0, 

where Path (a, b) is the set of edges on the path between a and b. Then the random 
variables {l{a"a — crb}} (^a b)er' mutually independent. 

Proof: Denote T = {(oi, &i), ■ ■ ■ , {o-v, K)} with v — |T|. Let V be the set of 
nodes on the path between oi and bi. Removing the edges in Path(ai, bi) creates 
a forest where the V-nodes can be taken as roots. Note that, by symmetry and the 
Markov property, conditioned on V, the distribution of the random variables 

{l{(7a = (Jb}}^a,b)er-{{auH)} ' 0) 

does not depend on the states of the V-nodes. In particular, l{(Ja^ = cxb^} is 
independent of (3). Proceeding by induction gives the result. ■ 

Size To satisfy SI, we restrict ourselves to "close pairs." We first show that the 
size of Tq, grows linearly as long as a > Ag, allowing us to also satisfy CI. A 
similar result is proved in [SS06]. 

Proposition 2 (Size of T^) Let a > Ag. Then 

IT |>- 



12 



Proof: Let 

r = {a e [n] : d{a, b) > a, e [n] - {a}}, 

that is, r is the set of leaves with no other leaf at evolutionary distance a. We will 
bound the size of F. For a e F, let 

B{a) = e y : d{a,v) < |} . 

Note that for all a.b E T with a ^ b we have B{a) fl B{b) = by the triangle 
inequality. Moreover, it holds that for all a G F 

\B{a)\ > 2LiJ, 

since T is binary and there is no leaf other than a in B(a). Hence, we must have 

2n-2 / 1 \ 

' 2LtJ " V2LSJ-V 

as there are 2n — 2 nodes in T. 

Now, for all a ^ F assign an arbitrary leaf at evolutionary distance at most a. 
Then 

|T„| > ^(n-|r|) 

where we divided by 2 to avoid over-counting. The result follows from the as- 
sumption a > Ag.M 

Sparsification Note that T^g satisfies CI but does not satisfy C2 as the pairs 
may be intersecting (see Proposition 1). We now show how to satisfy both CI and 
C2 by "sparsifying" T4g. In stating this procedure, we allow some flexibility (that 
is, arbitrary choices) which will be useful in analyzing the actual implementation 
in the next section. Let Ag < m < M he a constant to be determined later and 
assume T' is any set satisfying 

T4, C T' C 

We know from Proposition 2 that T' has linear size, that is, |T'| > n/4. We 
construct a linear-sized subset T of T' satisfying the non-intersection condition of 
Proposition 1 as follows. Let S :— T' and T" := 0. 



13 



• Take any pair (a*, b*) in S and add it to T". 

• Let So be any subset of S such that 5*0 contains all pairs with at least one 
node within evolutionary distance m of either a* or b* and contains no pair 
with both nodes beyond evolutionary distance M from both a* and b*. Re- 
move 5*0 from S. 

• Repeat until S is empty. 

• Return T := T". 

We claim that T is linear in size and that no two pairs in T intersect. 

Proposition 3 (Properties of T) Let T be any set built by the procedure above. 
Then, 

1. For all (a, 6), (a', 6') e T with (a, 6) 7^ {a',b') we have Path(a,6) fl 
Path(a', b') = 0. 

2. There is 7^ = 7s (M, /) > such that |T| > 7^71, where 7^ does not depend 
on T, but only on M, f. 

3. For all (a, 6) G T, we have 

2/ < d{a, b) < M. 

Proof: We first prove the non-intersecting condition. All pairs of leaves in T are 
at evolutionary distance at most m. Moreover, for any (a, b) ^ (a', b') in T, we 
have by construction 

mm{d{u,v) : u G {a,6},f G {a', 6'}} > m. 

Hence, the path between a and b and the path between a' and b' cannot intersect: 
we have 

d{a, b) + d{a\ b') - d{a, a') - d{b, b') < 0, 

which, using > for all e and the four-point test (see e.g. [SS03]), excludes 
the topology aa'\bb' (that is, the four-leaf topology where {a, a'} is one side of the 
internal edge and {6, b'} is on the other); and similarly for the topology abl\a'b. 
We now bound the size of T. Let (a, b) be a pair of leaves at evolutionary 

distance at most m. There are at most 2-2L/J =2L/J leaves at evolu- 
tionary distance at most M from either a or b. Therefore, at each iteration of 
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the sparsification algorithm, the number of elements of S removed is at most 

2L/J"^L/J <2 L/J,as each leaf removed is involved in at most 2 L / J pairs 
at evolutionary distance m. Since by Proposition 2, the size of T' is at least n/4, 
the number of elements in T at the end of the sparsification algorithm is at least 

|T|> ^ 

Finally, by our assumption on the phylogeny, two distinct leaves are always at 
evolutionary distance at least 2/. For the upper bound, use m < M.M 

Define ^ 

7, = 7,(M,/) = I ^, ■ 
2^LtJ+^ 

Definition 6 (Sparse pairs) We say that a setT C [n]^ is 7s-sparse if it satisfies 
the three properties in the statement of Proposition 3. 



3.3 Properties of the site clustering statistic 

In (1) fix an 7^ -sparse T. We now show that conditions CI and C2 lead to con- 
centration. Let 

' ' (a,6)eT ' ' (a,&)eT 

Moreover, for A > 0, define 

' ' (o,6)eT 

and note that 

[/T(Ai) = E[Wi|Ai], 

and 

C/t = E[[/t(A.)], 

for z = 1, . . . , k. 

Proposition 4 (Concentration of Ui) For all C > 0, there is c> Q depending on 
M, f such that 

P[|W, - f/T(A.)| > CI A.] < 2exp(-cC'n), 
almost surely, for all i — 1, . . . , k. (We will eventually use ( — o{l).) 
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Proof: Recall the following standard concentration inequality (see e.g. [MR95]): 

Lemma 1 (Azuma-Hoeffding Inequality) Suppose X = (Xi, . . . , X^) are in- 
dependent random variables taking values in a set S, and f : — > R is any 
t-Lipschitz function: |/(x) — /(y) | < t whenever x and y differ at just one coor- 
dinate. Then, > 0, 

P[|/(X)-E[/(X)]|>C]<2exp 



2t^m 

From Propositions 1, 2, and 3, the random variable Ui is a (normalized) sum of 
Q{ri) independent bounded variables. By Lemma 1, conditioning on Aj, we have 
|C/x(Ai) — Ui\ < C except with probabiUty exp(— Q(C^n)), where we used that 
m = n{n) and t = 0{l/n). ■ 

Moreover, we show separation. 

Proposition 5 (Separation otUi) If X — X' > where /3 > 0, then 

C/t(AO-C/t(A) >e-^^ (e2//3_i). 

Proof: We have 

1 

(a,6)eT 



Ur{X')-Ur{X) = ^ ^ Jg-A'd(a,;,) _ g-A<i(a,6)j 

' ' (a,6)eT 

> J_ ^ ^g-(A-/3)d(a,6) _ g-Ad(a,fe)j 

(a,b)er 

(a,6)eT 



' ' (a,6)GT 

since 2/ < d{a, h) < M for all (a, 6) e T by assumption. ■ 

Remark 7 The last bound may seem problematic because under our assumptions 
the scaling factor is allowed to have an unbounded support. In that case the RHS 
could be arbitrarily close to 0. But we will show below that we can safely ignore 
large values ofX. 
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4 Constructing the site clustering statistic from data 



Note that in the previous section we only established the existence of an appropri- 
ate site clustering statistic. We now show how such a statistic can be built from 
data without knowledge of the tree topology or site scaling factors. 



Notation We use the same notation as in the previous section. Further, we let 

1 ^ 



1=1 



Also let 



q{a,h) = E[g(a,6)] 

= E [E [p-i [IK = <7,i} - goo] |Ai]] 
= E [e-^i'i(«-6)] 

= $(d(a,6)). 

where we used our previous calculations. We define some constants used in the 
algorithm and its analysis whose values will be justified below. Let 

Also recall that $ is strictly decreasing and let 

Note that by Jensen's inequality and E[Ai] = 1 

$(5^) > e-^[^il-^» = e-^^ 

so that 

m > 5g > Ag, 

as assumed in the previous section. Similarly, by assumption, 

m = ^'\5g) < ^-\e-^^) < M, 
so that m < M. Finally let 

77 = min je"^^ - e'^'^^ e"^'^^ - 6"^^ e"^^ - 6"^-^^ e"^'^^ - e"^^} . 
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4.1 Site clustering algorithm 

We proceed in three steps, as in the ideaUzed setting of Section 3.2. However, 
unUke the ideaUzed setting, we do not assume the knowledge of evolutionary 
distances. Note in particular that it is not possible to estimate d{a, h) from the 
samples (crj^)^^^ because the rate distribution is unknown. Instead, we use g(a, h) 
as a rough estimate of how close a and 6 are in the tree. This will suffice for our 
purposes, as we show in the next subsection. The algorithm is the following: 

1. ( Close Pairs) For all pairs of leaves a, 6 e [n], compute q{a,h) and set 

r = {(a,6)e[n]J : ?>,6)>a;-}. 

2. (Sparsification) Let S := T' and T" := 0. 

• Take any pair {a*,b*)mS and add it to T". 

• Remove from S all pairs (a, b) such that 

max{g(c*, c) : c* e {a*, b*}, c e {a, b}} > oo^. 

• Repeat until S is empty. 

3. (Final Statistic) Return T := T". 

4.2 Analysis of the clustering algorithm 

Let T be the set returned by the previous algorithm. We show that it is 7s-sparse 
with high probability. 

Proposition 6 (Clustering statistic) Under Assumption 1, for all < 5 < 1 

there exists a constant < C < +oo (depending on g) such that the set of pairs 
T returned by the previous algorithm is ^yg-sparse with probability 1 — 6 provided 
that the number of samples satisfies 

k > Clogn. 

Moreover, the algorithm runs in polynomial time. 

Proof: We first prove that all q{a, bys are sufficiently accurate. 
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Lemma 2 For all < 5 < 1, there exists a constant < C < +oo (depending 
on g) such that 

\q{a,h) - g(a,6)| < r], 

for all (a, h) e [n]^ with probability 1 — 5 provided that the number of samples 
satisfies 

k > Clogn. 

Proof: For each (a, b) e [n]^, g(a, b) is a sum of k independent bounded vari- 
ables. By Lemma 1, taking ( — rjwe have 

\q{a,b) -q{a,b)\ < r], 

except with probability 2exp(— C'/c) for some C > depending on poo and rj. 
Note that there are at most elements in [n]^ so that the probability of failure is 
at most 

2n^exp(-C"- Clogn) < S, 

for C sufficiently large. ■ 

We return to the proof of Proposition 6. Assume that the conclusion of the pre- 
vious lemma holds. Our goal is to prove that the site clustering algorithm then 
follows the idealized sparsification procedure described in Section 3. 

1. We first prove that the set 

r = {(a,6)e[n]J : q(a,b)>uj-}. 

satisfies 

T43 C T' C Tm. 
Let (a, b) be such that d{a, b) < Ag. Then 

g(a,6) = $(d(a,6))>$(4^)>e-^^ 
by monotonicity and Jensen's inequality. Hence 

q{a, b) > e-^^ - 7] > e"^^ - (e"^^ - e"^'^^) > e'^-^^ = tu" , 
and Tig C T'. 

Similarly, let (a, b) be such that d{a, b) > m. Then 

q{a, b) = ^{d{a, b)) < *(m) = u^, 

and 

q{a, b)<cum + v< e-'^ + (e"^'^^ - 6"^^) = uj-, 
so that T' C Tm. 
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2. Let T" be the set obtained during one of the iterations of Step 2 of the site 
clustering algorithm and fix a pair (a*, b*) e T". We need to show that the 
set 5*0 of pairs (a, 6) in T" such that 

max{g(c*, c) : c* e {a*, b*}, c e {a, b}} > a;+ (4) 

is such that it contains all pairs with at least one node within evolutionary 
distance m of either a* or b* and contains no pair with both nodes beyond 
evolutionary distance M from both a* and b*. In the first case, assimie 
w.l.o.g. that 

d{a, a*) < m. 

Then, arguing as above, 

g(a,a*)>e-5^-(e-^^-e-^-^^)=a;+, 
and (4) is satisfied. In the second case, for all c e {a, b] and c* e {a*, b*} 

d{c, c*) > M, 

and 

so that (4) is not satisfied. 

The two properties above guarantee that the algorithm constructs a set T as in the 
idealized sparsification procedure of Section 3. In particular, T is 7s-sparse by 
Proposition 3. ■ 

5 Tree reconstruction 

We now show how to use our site clustering statistic to build the tree itself. The 
algorithm is composed of two steps: we first "bin" the sites according to the value 
of the clustering statistic; we then use the sites in one of those bins and apply 
a standard distance-based reconstruction method. By taking the bins sufficiently 
small, we show that the content of the bins is made of sites with almost identical 
scaling factor — ^thus essentially reducing the situation to the unmixed case. 

Throughout this section, we assume that there is a 7fc > such that k — rf"' . 
We also assume that T is 7s-sparse, that U stands for a copy of the corresponding 
clustering statistic under scaling factor A, and that Assumption 1 is satisfied. 
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5.1 Site binning 

Ignoring small and large scaling factors We first show that, under Assump- 
tion 1, the scaUng factor has non-negligible mass between two bounded values. 

Proposition 7 (Bounding the scaling factor) We have 

P [A < A < A] > X, 

where 



A- ^ 



A = 



1 - e-5fl ' 
and 

_ 1 - e-5» 
X - ^ ■ 

Proof: From our convention that E[A] = 1, Markov's inequality implies that 

1 1 - 



P[A>A1<^ 2 

For the other direction, we reproduce the argument in Remark 2. Recall that 
we assume that 

$-1 (e-6») < M. 

Then the probability mass 5 below e (for e < 6g/M) must be such that 

since 

e~^^ > $(M) > 5e-'^. 

Take e = g/Mso that 5 < e'^^. 
Then we have 

p [A < A < A] > 1 - (e-^^) - =x, 

as desired. ■ 

Translating the previous proposition into a statement about W-values, we obtain 
the following. 
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Proposition 8 (Bounding W- values) Letting x be as above, we have 



¥[U< Ur{A) <U]>x, 

where 
and 

Proof: Recall that 

Ur{A) = E[W I A] = J2 e-^'^^'^'^'l 



(a,6)eT 



Since 

2/ < d{a, h)<M 

for all (a, h) e T, we have 



g-MA < J_ ^ g-Ad(a,6) ^ g-2/A_ 



' ' (a,6)GT 

The result then follows from Proposition 7. ■ 

Binning the sites We will now bin the sites whose clustering statistic lie be- 
tween IJ_ and U. The previous proposition guarantees that there is a positive frac- 
tion of such sites in expectation. Let 

A 



logn' 



be the size of the bins in W-space, where 7;/ > is a constant to be fixed later. To 
avoid taking integer parts, we assume for simplicity that C/ — C/ is a multiple of 
A[/. Let _ 

U-U + 2Au 



u 



be the number of bins. (The extra 2Au in the numerator accounts for estimation 
error. See below.) Note that U — U_ = Q{1) and therefore Nu = 6 (logn). We 
proceed as follows: 
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• (Initialization) For j — 0,..., Nu, 

- B, = 0. 

• (Main Loop) For i — 1, . . . ,k, 

- (Out-of-bounds) If Ui ^[U-/^U:U + Au) then Bo := Bo U {i}. 

- (Binning) Else if 

U,e[U-Au + U - l)Au,U -Au+ jAu) 
then Bj := Bj U {i} and B>o := -B>o U {i}. 
Restating Proposition 4, we have: 
Proposition 9 (Concentration of W-values) We have 

H-Ur{K)\<Au, yie{l,...,k}, (5) 
except with probability exp (— log^ n)). 

Proof: Taking ( = Au in Proposition 4, (5) holds except with probability 

2n"' exp (— log^ n)) = exp (— log^ n)) . 



We first show that each bin contains sites with roughly the same scaling factor. We 
first need a bound on the scaling factors in B^q. (Note that, because we needed 
that the bounds U_ and U be independent of T (which itself depends on unknown 
evolutionary distances). Proposition 7 does not apply directly here.) 

Proposition 10 (Bounds on selected scaling factors) Assume (5) holds. There 
is 7[/ > small enough so that for all i e B^o 

fg . 2M 



M2 - /(I - e-5fl) 
Proof: Since Ur{K) <U + 2Au by (5), we have 

U + 2Au > ^ J] 



ITI 

' ' (a,6)eT 

> e . 



^-Aid{a,b) 
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Choose 7C7 > small enough, so that 

U + 2Au < e-f^. 

Taking logarithms, 

A >^ 

Similarly, since Ur{Ai) >U_ — 2Aj/ by (5), we have 

' ' (a,6)eT 

Choose 7{/ > small enough, so that 

U-2Au> e-'^^. 

Taking logarithms, 

2M 

A, < 



■ 

For J e {l,...,Nu},\Qt 

Uj - Au + + 1/2) Au, 

be the midpoint of the j-th bin. Using the fact that Ur{X) is strictly decreasing in 
A e R+, we define Xj as the unique solution to 

fori e{l,...,Art;}. 

Proposition 11 (Bin variation) Assume (5) holds. For any 7a > 0, one can pick 
7c/ > small enough (depending on M, f,g) such that for any j e {1, . . . , Nu} 
andi e Bj, 

|A,;-A,|< 



logn 
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Proof: This follows from Proposition 5. Assume that Uj > Ur{A.i). (The other 
case is similar.) Using the upper bound in Proposition 10 and (5), we get 

3 
2 

Hence, 



> U, - UriK) > e-'- (e^/^ - l) > (e^/^ - l) exp {'J^^^) • 



1 ^ / 37c7exp(|yp?^ 



2/ " I 21ogn 



Next, we argue that at least one bin contains a non-negligible fraction of sites. We 
say that a bin Bj is abundant if 

X 



> k- 



Proposition 12 (Abundant bin) We have 

3j* e {1, . . . , Nu} such that Bj* is abundant, (6) 
except with probability exp{—fl{n'^'')) for some js > 0. 

Proof: For the analysis, we introduce ^cfj?/oM5 bins for the (unknown) expected 
W-values. That is, for i = 1, . . . , fc, we let i e Bj if 

Ur{A^) G [U-^Au + {j-l)Au,U-Au+jAu), 

for some j G {2, . . . , Nu — 1}, or i e Bq otherwise. 
Then, there is j** e {2, . . . , Nu - 1} such that 

F[Ur{A) e [U-Au + {r-l)Au,U-Au+rAu)] > (7) 

that is, the probability that a site falls into bin Bj** is at least xl^u- This follows 
immediately from Proposition 8 and the fact that the bins are disjoint and cover 
the interval [f/, U] . 

From Lemma 1 and (7), 



P 



I * * I ^ k~ 



2N 



u 



< 2exp ( -^M^^^ ) = exp (-Q(n^VV^)) ■ 



Therefore, if (5) holds, one of Bj**_i, Bj**, or Bj** j^i must contain at least a third 
of the sites in Bj**. This occurs with probability at least 1 — exp(— 0(n''''')) for 
some 75 > 0. ■ 
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5.2 Estimating a distorted metric 

Estimating evolutionary distances We use an abundant bin to estimate evolu- 
tionary distances. 

• (Abundant bin) Let B* be any bin with at least k-^^ sites and set 

k* = B* 
and 

A* = \j, 

where j is the index of B*, that is. A* is the midpoint of B*. 

• (Evolutionary distances) For all a b e L, compute 

ieB* 

We prove that the q*{a, h) is a good approximation of e~^*'^^°''^\ 

Proposition 13 (Accuracy) Let 7a > 0, 7^ < 7fc/2 be fixed constants. There is a 
75 > such that the following hold except with probability exp(— ^2(n''''')).• 
i. There is at least one abundant bin. Let B* be an arbitrary such bin. 
2. And, for each i & B* and for all a ^ b E L, 

\q*ia, b) - e-^'-^^"'^) I < — + e-^*'^^"'^) (e^^ - l) . (8) 

Proof: The result follows from Propositions 9, 10, 11 and 12, and the following 
lemma. 

Lemma 3 Let^^ > 0, 7^ < 7jt/2 be fixed constants. Let 
and 

fg ~ ~ ~ 2M 
< A, Ai, . . . , A^ < 
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be such that, for all i, 



7a 



logn 



\-\ < 

Let a\ ~ T^\T, Aj, r] independently for all i. Then, for all a ^ h & L, 

k 



(9) 



k 



1=1 



< J_ + g-Ad(a,6) / 



7Arf(o.fe) 
e logn — 1 



except with probability exp ( — Q ^'"'^ / log n) ) . 



Proof: In Lemma 1, take m = k = il{ri^'' /logn), t = and ( = Then, 
except with probability exp (— / log n)), 



1=1 



< 



for all a ^ b e L. Moreover, by (9), 



-Xdia,b) _ i g-Aid(a,6) 
i=l 



< g-A<i(a,6) 



JA-Ai|(i(a,6) 



i=l 



1 — e 



Tree construction To reconstruct the tree, we use a distance-based method 
of [DMR09]. We require the following definition. 

Definition 7 (Distorted metric [Mos07, KZZ03]) LetT ^ {V, E; L, d) be a phy- 
logeny and let r, > 0. We say that d : L x L ^ {0, +oo] is a (r, -distorted 
metric /or T or a (r, ^) -distortion ofd if: 

1. [Symmetry] For all u,v & L, d is symmetric, that is, 

d{u, v) = d{v, u); 
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2. [Distortion] d is accurate on "short" distances, that is, for all u,v & L, if 
either d{u, v) < "if + t or d{u, v) < ^ + r then 



d{u, v) — d{u, v] 



< T. 



An immediate consequence of [DMR09, Theorem 1] is the following. 

Theorem 3 (See [DMR09].) Let T = (V, E\ L, d) be a phytogeny with n leaves 
in Tf^g. Then topology ofT can be recovered in polynomial time from a (r, ^)- 
distortion d of das long as 

f 



r < 



5' 



and 



^ > 5glogn. 

(The constants above are not optimal but will suffice for our purposes.) 

See [DMR09] for the details of the reconstruction algorithm. 

We now show how to obtain a{f/5,5g log n) -distortion with high probability. 

Proposition 14 (Distortion estimation) There are '^u-,l^-,lq-,lk > 5o that, given 
that the conclusions of Proposition 13 hold, then 



d{a,b) = - ln{q*{a,b)+) , 
is a (A*//5, 5\*g log n)-distortion of\*d. 
Proof: Define 



(a, b) e L X L, 



and 



{(a, 6) e L X L : d{a,b) < Ibglogn}, 



{{a,b) e L X L : d{a,b) > 1251 log n}, 



Let (a, 6) e L2 . Note that 



2M 



-59 



15 g logn 



1 

i7o 



y' ' 

n'1 



where the last equality is a definition. Then, taking 7^ (and hence jk) large enough 
and 7a (and hence ^u) small enough, from (8) we have 



d{a, b) — X*d{a, b) 



5 - 5 
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Similarly, let (a, b) E Lj. Note that 

e-*''^-'') < exp (- 12,logn)^-^, 

where the last equality is a definition. Then, taking 7^ large enough and 7a small 
enough, from (8) we have 

A* / 

d{a, b) > 6A*5flogn > b\*g\ogn H — —. 



We finally state our main tree-construction result. 

Proposition 15 (Tree reconstruction) Under Assumption 1, given a jg-sparse T 
there is a 7^ > large enough so that the topology of the tree can be reconstructed 
in polynomial time using k — n"^ samples, except with probability ex.]i{—VL{n^^)) 
for some 75 > 0. 

Proof: The result follows from Theorem 3 and Proposition 14. ■ 
Combining Propositions 6 and 15, we get Theorem 2. 

6 Concluding remarks 

Using techniques from the recent unpublished manuscript [MRU], our results can 
be extended to handle the more general GTR model of molecular evolution which 
allows Q-matrices to be time-reversible. This generalization involves choosing 
pairs of leaves that are not only connected by edge-disjoint paths, but also far 
enough from each other. One can then use mixing arguments to derive the inde- 
pendence properties required for concentration of the site clustering statistic. We 
leave out the details. 
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